Delving into gene-set multiplex networks facilitated by a k-nearest neighbor-based measure of similarity

Gene sets are functional units for living cells. Previously, limited studies investigated the complex relations among gene sets, but documents about their altering patterns across biological conditions still need to be prepared. In this study, we adopted and modified a classical k-nearest neighbor-based association function to detect inter-gene-set similarities. Based on this method, we built multiplex networks of gene sets for the first time; these networks contain layers of gene sets corresponding to different populations of cells. The context-based multiplex networks can capture meaningful biological variation and have considerable differences from knowledge-based networks of gene sets built on Jaccard similarity, as demonstrated in this study. Furthermore, at the scale of individual gene sets, the structural coefficients of gene sets (multiplex PageRank centrality, clustering coefficient, and participation coefficient) disclose the diversity of gene sets from the perspective of structural properties and make it easier to identify unique gene sets. In gene set enrichment analysis (GSEA), each gene set is treated independently, and its contextual and relational attributes are ignored. The structural coefficients of gene sets can supplement GSEA with information about the overall picture of gene sets, promoting the constructive reorganization of the enriched terms and helping researchers better prioritize and select gene sets.


Introduction
A gene set, by definition, is a set of gene symbols or other equivalent strings with a specific identifier that usually summarizes its essential information [1].For example, a gene set called "GSE9006_HEALTHY_ VS_TYPE_1_DIABETES_PBMC_AT_DX_DN" from the IMMUNESIGDB database includes 198 gene symbols (e.g., CYP4F3, BASP1, and VNN3).Its naming came from case-control microarray research that found these 198 genes down-regulated in peripheral blood mononuclear cells (PBMCs) of healthy subjects compared to newly diagnosed type I diabetes patients [2,3].The genes in a gene set may have a mixture of discovery sources, such as biological discovery or computational inference [1], and creation of the set may rely on curators who browse online publications about genes and manually select and annotate genes to form a gene set based on their domain knowledge [4].Nowadays, gene-set databases are expanding rapidly, and it becomes more and more important to understand the relationship among gene sets that In addition, we are interested in the intrinsic features of gene sets in their interacting communities under specific biological conditions.A multiplex network is a mathematical framework that aligns with our purpose [13,14].Gene sets can be nodes in the layers of the multiplex network, and the weights of their incident edges can be scalar similarities defined as above.The structural coefficients of a multiplex network integrate the data from all the layers, offering clues to identify characteristic gene sets in the network [15][16][17].Additionally, the findings of network properties of gene sets may help improve the interpretability of gene set enrichment analysis (GSEA) [8,18].GSEA and its variants often emphasize the gene-level statistics or effect size/significance of enrichment [19,20].The contextual and relational attributes of gene sets are often omitted in GSEA [8,21].By delving into the gene set communities, we may have more opportunities to understand the overall picture of gene sets and have extra criteria to select enriched terms in GSEA.

General workflow
A multiplex network consists of a list of regular (monoplex) networks called layers.The layers share the same group of constituent nodes, and the inter-layer relation is simple because a node only connects with its identical node in other layers (for formal definitions, see section 2.6).A real-life example of a multiplex network can be a multiplex social network for a group of people.The same node represents the same person in networks corresponding to different life scenarios, including friendship, hobbies, colleagues, and social media followers.Fig. 1 shows the essential steps in building a multiplex network of gene sets.The analysts can choose and fetch the target collection of gene sets from a database, for example, gene ontology gene sets from org.Hs.eg.db.It also depends on the analysts to determine the data source for computing similarities in each layer of the multiplex network that establishes the biological context for measurement in that layer, which in our experiment is a cell population of a scRNA-seq dataset (section 2.2).Thus, the number of different cell populations equals the number of layers (Fig. 1.A).Other types of samples may work as well.
The Jaccard coefficient is a popular semantic or knowledge-based similarity measure for gene sets.We can modify it to be context-based by adding weights to genes according to their average expression levels in a cell population.Two gene sets are likely dissimilar if their common genes have no expression whereas their private genes dominate, and vice versa.However, the modified Jaccard coefficient cannot capture the total effect of gene expressions on similarities among gene sets because the interactions among genes are crucial and complicated.This is where multivariate association functions may come into play.We designed simulation and running time comparison studies to compare the performance of classical multivariate association functions (for more details, see sections 2.3-2.5).In particular, we focused on a k-nearestneighbor (KNN)-based association function [12,22].Readers can also check the comprehensive review by Josse and Holmes [12].
Once we are using a similarity measure (S KNN in section 2.4) to compute a list of similarity matrices corresponding to multiple biological conditions, we can filter edges with lower-than-threshold weights and instantiate a multiplex network of gene sets and study its structural properties (section 2.6, Fig. 1.B, C).At a high level, the layer-wise relations signify the connections between biological contexts.For illustration purposes, we select cell populations (subtypes of immune cells) that are well-known to researchers.Other than layers, the distributions of subcollections of gene sets, for example, multiplex communities or families of gene ontology gene sets, can also give insights into the behaviors of a group of gene sets of interest.For example, GO:0070161 (anchoring junction) and GO:0005681 (spliceosomal complex) are two parent gene sets in the level-3 gene ontology-cellular component (GO-CC) gene sets.By observing the distribution of their families, including descendant gene sets, we can get acquainted with all the versions of connections between these two families in both the knowledge-based and context-based networks (section 2.7).To quantify the difference in inter-family connections between the Jaccard network and multiplex networks, we introduce two metrics called within-family and betweenfamily fold changes (section 2.7).
At the scale of individual gene sets, the structural coefficients are multifaceted relational attributes of gene sets about their connectivity across layers, which include a multiplex PageRank centrality, a multiplex clustering coefficient, and a multiplex participation coefficient (Fig. 1.C, section 2.7).The multiplex PageRank centrality and clustering coefficients are generalized from their counterparts in the monoplex network by rewarding "consistent" players with high centrality or the clustering tendency in their neighborhoods across layers.The participation coefficient evaluates the uniformity or bias of connections of a gene set over the layers in a multiplex network.Those heterogeneous features reveal the diversity of gene sets from the structural perspective and are helpful in GSEA applications.Gene sets are independently analyzed in GSEA to estimate their member genes' deviation from a random distribution on a target gene list.Researchers frequently use a singular metric of the effect size, for example, a normalized enrichment score (NES), to determine the priority of gene sets for interpretation.We explored the potential of using the structural coefficients to expand the current scope of GSEA by reorganizing the enriched terms and providing more clues on prioritizing gene sets for better exposing terms deserving attention (section 2.8).

Single-cell RNA-seq datasets and gene sets
The pre-processed single-cell RNA-seq datasets used in this study include pbmc3k.finalfetched from SeuratData, which consists of 2638 immune cells collected from the peripheral blood of a healthy human donor [23], and E-MTAB-11536 (EMBL-EBI), which involves 329,762 immune-related human cells from 16 tissues of 12 deceased donors [24].The pre-processing included quality control, gene count normalization, log transformation, variable gene detection, feature scaling, and data integration when applicable.
The collections of GO-CC and GO biological process (GO-BP) gene sets were accessed through org.Hs.eg.db (version 3.14.0) in gseGO [25][26][27][28].Their IDs, descriptions, and symbols of constituent genes were formatted and stored in JSON files for further use [29].ImmuneSigDB (v2022.1/2023.1) is a collection of gene sets stemming from 389 published immunology studies.Researchers manually design comparison experiments for differential gene expression analyses.Upregulated or downregulated genes are packaged as gene sets [3].They can be downloaded from the Molecular Signatures Database published by the Broad Institute [30].

Notation: similarity matrix of gene sets
Constructing a distance matrix (metric or non-metric) or a similarity matrix can provide insights into the relationship among observations not embedded in a common vector space [31].For example, let T = { 1 , ⋯ ,   } be a collection of  gene sets, and given a similarity function  ∶ T × T → ℝ, the  ×  matrix with its -th entry as (  ,   ) is called a similarity matrix.We note that the similarity functions do not have a common definition, and association functions or the inverse of distance/dissimilar functions can work as similarity measures.The Jaccard coefficient is a popular similarity function for gene sets based on their semantic relationships.In contrast, given gene expression profiles of gene sets (e.g., gene count matrices restricted to the genes in each gene set), we can deploy numerical models to calculate the similarity.

Fig. 1.
Essential steps in building a multiplex network of gene sets based on S KNN .(A) Cells from three populations , ,  are collected and sequenced for their transcripts subjected to further preprocessing.For the simplicity of illustration, we assume that the number of samples in the gene expression matrix of each layer is  , and all matrices have the same  detected genes.These constraints are not mandatory in practice.For any gene set  in a cell population, selecting  genes (rows) in the gene set among the  genes of the processed gene expression matrix produces a submatrix that determines the distribution of cells in that feature space.Gene sets with too few or too many overlapped genes or that display excessively sparse expression values in the submatrix are excluded.Different layers may have slightly different collections of gene sets that pass the filtration.We take their intersections to make the filtering collections the same.An implemented KNN algorithm with a chosen  and a metric will take the  ×  submatrix as the input and return a  ×  index matrix that stores the indices of the k-nearest neighbors for each cell.(B) S KNN uses a pair of index matrices corresponding to two gene sets to compute the inter-gene-set similarity (section 2.4).By iterating the computation for all pairs of gene sets in all cell populations, three  ×  gene-set similarity matrices are obtained.(C) It is optional but recommended to set a threshold to truncate entries with low values in the similarity matrices to reduce the effects of random noise.The processed similarity matrices can be treated as weighted adjacency matrices for layers in a multiplex network.An example of the downstream analysis of the multiplex network is computing the structural coefficients of gene sets (section 2.7).

Classical methods for measuring similarity/multivariate association
Jaccard coefficient The Jaccard coefficient is defined as where ,  are two gene sets and | ⋅ | refers to the number of elements in a finite set [32].To account for the effects of gene expression, we propose a modified Jaccard coefficient by weighing genes according to their average expression levels in a scRNA-seq dataset.If the mean expression level of gene  is   , then the modified Jaccard coefficient is The modified RV coefficient, distance correlation coefficient, and Mantel coefficient have closed-form definitions and related tools in Python (Table 1).For the KNN-based association, we gave its definition and used sklearn.neighbors to implement it in Python.
Modified RV coefficient Suppose that gene sets  and  include  and  detected genes, and gene expression submatrices  ∈ ℝ × and  ∈ ℝ × are column-centered, where  is the number of cells in the dataset.Inputs  and  to the  function return an empirical coefficient for their multivariate linear association.The modified RV coefficient (  in Table 1) was devised to eliminate the dependency on sample size by the original RV coefficient [12,33].[12,35].

KNN-based association
Given a submatrix  and a metric (e.g., L1, L2, or cosine distance), a complete graph and its KNN subgraph can be built where in the KNN graph, each node represents a sample and  edges connect between the node and its KNNs (excluding the self-loop).We deployed sklearn.neighbors.NearestNeighbors to find and store the k-nearest neighborhoods for each sample (Fig. 1.A) using the "auto" algorithm and √  , with "L2" (Euclidean distance) as the default  and metric [36].
Suppose that the KNN graphs for two gene sets have been determined as described above.Friedman and Rafsky proposed a KNN-based coefficient denoted as Γ to measure the cumulative overlaps of the samples' neighborhoods, which in turn reflects the overall resemblance of the multivariate features [22].
where A [] , A [] are the adjacency matrices for the KNN graphs built on  and  [22]; and K  (), K  () are the sets of KNNs of sample  (excluding itself) in , .We normalized Γ , by the possible maximum counts of overlapping neighbors to derive S KNN as follows (Fig. 1.B):

Simulation and comparison studies
In the first simulation experiment, two multivariate Gaussian random vectors  ∼ N(,   ),  ∼ N(,   ) in two feature spaces are independent, that is,  ⟂ , and 1000 observations are drawn from the joint distribution of , .We fix  to be 300 and assign  to be 100, 200, ⋯ , 800 (Fig. 2.A, upper left).In the second experiment,  ∼ N(,   ), and  1 =  1 , ⋯ ,   =   ,  ≤  = 300; that is, realizations of  are subvectors of realizations of .The overlap between ,  gradually increases as  = 10, 30, ⋯ , 270 (Fig. 2.A, upper right).In the third experiment,  and  maintain the same relationship, but now  is a random vector of a Gaussian mixture distribution such that | =  ∼ N(  ,   ),  = 1, 2, 3, where   ∈ ℝ  is a random mean vector and  ∈ ℝ × is randomly generated to increase the anisotropy of data in each subpopulation;  specifies the component memberships of observations.Observed data are equally drawn from three subpopulations (Fig. 2.A, lower left).In the fourth experiment,  is the same random vector as in the third experiment, and  =  () =  +  •3 , where ,  ∈ ℝ are scalar constants and • indicates the element-wise Hadamard power. is assigned to be 100, 200, ⋯ , 800 (Fig. 2.A, lower right).Finally, we measure the similarities of 50 random pairs of ImmuneSigDB gene sets.The cells for computing similarities are 1000 sampled classical monocytes in the blood of the E-MTAB-11536 dataset (Fig. 2.B).  ,  (exponent = 1),   , and S KNN with metrics  1 ,  2 ,  and  = 31 ≈ √ 1000 are deployed to measure the associations as described in previous sections.The experiment is repeated 50 times for each given  and .
Meanwhile, we perform a running time comparison of   ,   ,   , and S KNN with the default  and metric that all support parallel computations.The sample size is 100, 200, ⋯ , 800, and the number of gene sets is 200, 400, ⋯ , 1000.The cells are sampled from classical monocytes in the blood of the E-MTAB-11536 dataset, and gene sets are sampled from ImmuneSigDB (v2022.1).We record the similarity matrices computed by the tested methods with different sample sizes or numbers of gene sets and their computing time.The Pearson correlation of two flattened upper triangles (above the main diagonals) of similarity matrices is used to quantify the linear association of two similarity measurements [40].To justify the impact of sample size, given a fixed collection of gene sets, we treat the measurements of   or   under maximal sample size as the reference and compute the Pearson correlations between them and other similarity measurements (Fig. 2.C).The machine used in this part is a desktop workstation (HP Z640) with 10 CPUs (1.70 GHz Xeon) and 64 GB of RAM.Moreover, to explore the impact of hyperparameters ( and metrics) on the performance of S KNN , we fix the sample size to be 1000 and randomly choose 100 Immune-SigDB gene sets, with either one of the metrics ( 2 or ) and different  values (Fig. 2.D, E).

Representation of a multiplex network for gene sets
A multiplex network is a special subtype of a multilayer network.The latter is a triple ) is an ordered list of networks (layers); G  = (  ,   ) is an individual network where   is the set of nodes and   contains edges that can be either directed or undirected and weighted or unweighted; and  is a list of bipartite networks such as  , = (  ,   ,  , ), where ,  ∈ Y,  ≠ .In other words,  depicts the pairwise connections between different networks in ⃗ G [13].
When the nodes in   remain the same for  ∈ Y (so-called replica nodes) and edges between any pair of distinct layers connect exclusively to the identical replica nodes, the multilayer network M is called a multiplex network.Specifically, if M is an unweighted multiplex network, then the || × || matrix A, called the supra-adjacency matrix, can describe its structure compactly as where A [] ,  ∈  is the adjacency matrix for G  [13].
A similarity matrix of gene sets can be treated as an adjacency matrix for a layer if its entries are non-negative, for example,  = S KNN .Furthermore, a preset positive threshold can filter edges with low weights, reducing the unwanted effects of random noise.Given a list of adjacency matrices for gene sets with the edges exclusively linked to identical gene sets, we can implement a multiplex network that satisfies the formal definition above.

Exploration of structural properties of a gene-set multiplex network
High-level properties of a gene-set multiplex network We first explore the relations among the layers and subcollections of gene sets at a high level.One way to detect the inter-layer connections is to first cluster nodes in each layer of the multiplex network [41], and then use the normalized mutual information (NMI, sklearn.metrics.normalized_mu-tual_info_score)to quantify the resemblance of cluster structures, which in turn reflects their layer-wise similarities [15].The seaborn.clustermap displays the layer-wise similarities with a dendrogram showing the hierarchical clusters of layers under the default setting [42,43].A multiplex community is a collection of gene sets with connections across multiple layers.The Leiden algorithm can be generalized for community detection in a multiplex network.The quality function for a multiplex network is defined as a weighted sum of quality in each layer as where  is a partition of nodes that is the same for all layers and   is a weight for the -th layer that can be uniform.We deploy leidenalg.ModularityVertexPartition in Python for this task, and the optimization process is the same as that in a monoplex network [41].Another important example of a subcollection of gene sets is a family of gene ontology gene sets, including a parent gene set and its descendants.For example, GO-CC (cellular components) gene sets have a tree-like organizational structure that a directed acyclic graph can represent.The difference is that some gene sets may have more than one parent.The connections between two families of gene sets may change drastically in layers of the multiplex network compared to the knowledge-based Jaccard network.To quantify this difference, we propose a fold change of similarities as follows: where  is the number of layers in M; and   ,   are gene expression matrices corresponding to the gene sets ,  in the i th layer.Alternatively speaking, FC is the ratio between the average score of S KNN for all pairs of gene sets ,  in two families F , H across all layers of M and their average  score in the Jaccard network.When F , H are the same family of gene sets, we call it a within-family fold change.If the numerator and denominator are zero, we assign their FC as a missing value.If only the denominator is zero, FC is conservatively evaluated to be 1.To consider the overall context-over-knowledge difference in the connections between a family of gene sets F and other families, for example, the collections of level-3 GO-CC gene sets and their descendants, we define the between-family fold change as where Ω is a collection of families.The python package goatools is used in this task to search level-3 GO-CC gene sets and their descendants [44].
We use the Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) and graphical interface to visualize multiomics networks (Grimon) to visualize the distributions of gene sets.UMAP assumes that the data are distributed uniformly on a manifold, and computationally, it modifies the raw distance/dissimilarity matrix and embeds the newly constructed graph in the low-dimensional space via a force-directed graph layout algorithm [45].UMAP suits our needs as a fast and straightforward tool without displaying excessive information such as edges.Additionally, we apply Grimon to aggregate and adjust the UMAP coordinates of all the layers in the multiplex network for a unified view [46].

Structural coefficients of gene sets
At the scale of individual nodes (gene sets), several structural coefficients have been proposed for a multiplex network with unweighted and undirected layers [15,16], which can be generalized for multiplex networks with weighted (or directed) layers.The multiplex participation coefficient for node  in a multiplex network M with weighted layers can be ) where is the sum of weights for incident edges of a node  at layer  and   = ∑ ∈Y    .A large   implies that node  has consistent connections across layers.If   is as low as 0, node  has connectivity at a single layer [15].
The (local) clustering coefficient of node  quantifies the tendency of nodes in its (open) neighborhood to form connected links (together as triangles) in a network, which is initially proposed for unweighted and undirected graphs as , where A is an adjacency matrix and   = ∑ ≠ A  [47]. ) [51,52].
An extended version for a multiplex network can be developed by placing the edges of a triangle on two or three different layers, which correspond to coefficients  ,1 and  ,2 [15,16].Without loss of generality, we assume that G  is directed and weighted and has no self-loops for all  ∈ .The equations in the definitions can be transformed into matrix forms to facilitate vectorization and parallel processing.The proof is in the appendix.
We refer to a multiplex version of the PageRank coefficient, which reflects the global centrality of a node in a multiplex network, and implement it in Python [17].The algorithm attempts to integrate centrality information from layers iteratively.Suppose that the adjacency matrices has already been determined.Then, the multiplex PageRank centrality for node  in layers ( −1  )  );  is the Kronecker delta function; and  L  , ,  are scalar hyperparameters (defaults are respectively 0.85, 1, and 1 in our experiments) [17].The equation can be expressed more concisely as where M = A [L  ] , M is its column-normalized form (entries in a column remain zeros if divided by a zero column sum), and  is a diagonal matrix such that   = ( −1  )  .The base case is when  = 1 and [ 1  1 , ⋯ ,  1  ]  is defined to be the same as the PageRank centrality of a monoplex network L 1 [53].The numeric computation is through iterative procedures, and the final converged multiplex PageRank centrality for node  is   =    [17].We use the ratio of raw centrality to a uniform mass 1 number of gene sets to represent the centrality of nodes in this study.

Applications in gene set enrichment analysis
GSEA is a popular tool for interpreting the functional meaning of a target gene list.Given an ordered list of  weighted genes Ω = { 1 , ⋯ ,   }, for example, differentially expressed genes ranked by a metric for differences or adjusted p-values in hypothesis tests, the enrichment score (ES) for a gene set  is defined as follows: where ,  are positions in the ordered list of genes,   is the weight for the gene   , and  is a scalar hyperparameter [18].By permutations, a null distribution can be built.The proposal of the NES aims to reduce the effect of different sizes of gene sets in multiple-hypothesis testing by rescaling the enrichment score [18].For a gene set  with a highly positive or negative NES, the genes in  are unlikely to be randomly distributed in Ω, which implies the existence of their underlying connection.We use clusterProfiler (version 3.17) in R to conduct GSEA [26].
Practitioners often deal with many enriched gene sets in an experiment.A typical routine to determine the priority of enriched terms is to rank them by their (normalized) enrichment scores, gene overlap ratio (GeneRatio), or adjusted p-values, and only the gene sets with top ranks are emphasized.Those metrics mainly consider the effect size or statistical significance of the distributional deviation of genes.Some researchers may even pick their terms of interest, regardless of the ranks of those gene sets.However, a neurological term enriched in a study about gliomas may have a fundamentally different meaning if it were enriched in a rheumatology study.GSEA treats each gene set independently, and the interpretation of enriched terms does not explicitly consider contextual and relational attributes of gene sets, which the structural coefficients can supplement.
We use the multiplex clustering coefficients (C1 and C2) as indicators to find local clusters of gene sets with significant similarities by searching the "consistent" neighbors of gene sets with high-level clustering coefficients, for example, sharing the neighborhood over the layers (S KNN > 0.2).Then, rather than relying on the knowledge of gene sets' hierarchical organizations [8,21], local representatives of the groups are selected if they have maximal (local) multiplex PageRank centrality in the groups, which produces context-based representatives of the groups in a data-driven manner.The gene sets other than the representatives in the group are removed from the enriched terms.As many terms with high (global) multiplex centrality are gene sets too ambiguous for interpretation, we can concentrate on more specific terms with lower multiplex centrality.Regarding the multiplex participation coefficient, ( ) 2 ∈ [0, 1] (section 2.7) is a scalar value to reflect the biased "participation" of a gene set  in a particular layer .By examining gene sets' participation coefficients and per-layer participation, we know the specificity of their connectivities to particular layers.To combine them, we use multiplex clustering coefficients and PageRank centrality to reorganize the raw structures of enriched terms in GSEA to reduce redundancy and skip enriched terms with overly broad functions.Then, we can prioritize gene sets by both NES and multiplex centrality levels.The alternative way is to follow the orders of structural coefficients of gene sets, for example, multiplex participation coefficients, depending on the interest of researchers.We find that the multipartite graph is a valuable tool (networkx.multilayered_graph)to visualize the enriched terms ranked by different standards in a sensible way [54].

Computing performance of S KNN
As described in section 2.5, we first conduct a simulation study to test the ability of selected similarity measures to reflect various simulated relations (Fig. 2.A, B).For simulation dataset 1, each data point  has the same probability  −1 of being the KNN of another point , and the neighborhoods in the random matrices  and  are also independent.Thus, we can estimate the expectation of S KNN (,  ) as [ In this experiment,  = 31 ≈ √  , and we observe the scores of S KNN fluctuate on its expectation  −1 ≈ 0.03103 with a little variance no matter which metric is loaded (Fig. 2.A, upper left).In other words, for two completely random and independent datasets, S KNN will give a near-zero similarity score in the choice of a reasonable  (e.g.,

√ 𝑁
).The difference between simulation datasets 2 and 3 is that dataset 2 has no internal structures while dataset 3 has three separate clusters.S KNN smoothly captures the gradient of similarity between a data matrix and its submatrix as the number of columns increases.However, other similarity measures might be too sensitive to the global structures of datasets as their similarity scores approach 1 as soon as the submatrix recruits about 50 (over 300) columns of the raw data matrix (Fig. 2.A, upper right and lower left).Given a non-linear function, S KNN and other similarity measures can assign a non-trivial and consistent score to it regardless of the dimension of the input dataset, though different metrics embedded in S KNN may lead to divergent scores (Fig. 2.A, lower right).Fig.We then perform a running time comparison experiment on   ,   ,   , and S KNN with the default metric and .For an execution with 800 samples and 1000 target gene sets, S KNN outperforms   and   remarkably having more than one order of magnitude less time cost (10.4minutes versus 2.4 and 4.0 hours).  has an even lower time consumption (Fig. 2.C, left) but at the cost of worse performance.For simplicity of discussion, we assume that all  gene sets have the same size as , and the computational complexity of the multiplication of an  ×  matrix and its transpose is O( 2 ).Then, from section 2.4 and Table 1, the time complexities of   and   are O( 2 +  3 ) and O( 2 ), respectively.For computing the  ×  similarity matrix, their time complexities are O[( 2 +  3 ) 2 ] and O( 2  2 ), respectively.In contrast, the time complexity of computing the  index matrices (Fig. 1) is O( 2 ) by the brute algorithm of sklearn.neighbors.Computing S KNN based on a pair of index matrices (Fig. 1) would require the complexity of O().Thus, the total complexity for computing the similarity matrix is O( 2  +  2 ).The time complexity of   for computing the similarity matrix is O( 2 ) given a map of average gene expressions of genes as its input.
Regarding the performance, the measurements by S KNN ,   ,   are mutually associated and influenced by the sample size.With the measurement by   or   under the largest sample size as a reference (Fig. 2.C, middle right), the correlations between the reference and the measurements by   or   under different sample sizes gradually become stable as the sample size increases.Because the cells in this experiment are sampled from a homogenous population, a requirement for a larger sample size for the convergence of measurements is expected if they were sampled from a heterogeneous population.The correlations between the reference and measurements by S KNN lie between   and   .For example, compared to   , S KNN is closer to   and farther from   .The measurements by   have poor Pearson correlations with both methods   and   , which suggests that gene membership and expression levels alone cannot explain the interactions among gene sets.
We further explore the impact of  and metrics on the outcomes of S KNN .With an increasing number of  (< 1 2  ), S KNN has the tendency of approaching the non-linear association function   and deviating from the linear association function   with both metrics  2 and  (Fig. 2.D).S KNN loaded with the  2 metric is significantly closer to   than when loaded with the  metric (p-value = 0.002 by Wilcoxon signed-rank test).Oppositely, S KNN loaded with the  metric is significantly closer to   than when loaded with  2 (p-value = 0.002 by Wilcoxon signed-rank test).Thus, the hyperparameters  and metrics can adjust the inclination of S KNN towards a linear or non-linear similarity measure.But in general, a change of hyperparameter  does not abruptly switch the effect of measurement: for  in the interval around 100, the outcomes of S KNN correlate strongly (Fig. 2

.E).
Similarity measures like S KNN have some limitations.Unlike the deterministic semantic relations, the quality of gene expression datasets influences the computational results.For example, a scRNA-seq dataset with low coverage depth may have few detected genes for gene sets, resulting in a sparse count matrix that is unreliable in predicting the similarity.Due to the stochastic expression nature of genes, especially regulatory genes, the current scRNA-seq technology may not reliably capture their gene expression levels, which may underestimate their connectivity in the community [55].In addition, the current exclusion criteria may reject some "true negative" gene sets that are silent under specific contexts.

The multiplex network model can capture the biological variation of gene sets in context
We use S KNN with the default setting to compute similarities among gene sets in GO-CC and ImmuneSig.Five cell populations from human donors (EMBL-EBI) that include memory B cells in spleens, alveolar macrophages in lungs, natural killer cells in lung-draining lymph nodes, naive T cells in jejunum epithelium, and resident memory T cells in bone marrow set up the biological context for the measurements, corresponding to layers in the multiplex network (Fig. 3).We exclude gene sets that are too small or too large, namely, less than 10 or greater than 2000 genes in a gene expression matrix.To mitigate noise in the network, edges whose weights are lower than a positive threshold (e.g., 0.2) are discarded.We add an auxiliary network built on the Jaccard similarity matrix as a knowledge-based reference.
The layer-wise similarities are measured by NMI, and hierarchical clustering is performed (section 2.7).Layers for T cells and natural killer cells are clustered close to the layer for B cells.The layer for macrophages is on another branch of the hierarchical tree (Fig. 3.A).The hierarchical structure of layers in the multiplex network fits the model of hematopoiesis and the classification of peripheral agranular leukocytes in physiological states [56], which demonstrates the efficacy of multiplex networks for capturing a genuine biological variation.
In contrast, the knowledge-based Jaccard network has the least NMI with all layers in the multiplex network.In the UMAP space of the embedded Jaccard layer, GO-CC gene sets barely cover Immune-Sig gene sets.Conversely, these two groups of gene sets often significantly overlap in the UMAP space of embedded layers of the multiplex network (Fig. 3.D).The difference between a knowledge-based and a context-based network is prominent in this example.From a biological standpoint, this is reasonable, as gene sets in ImmuneSig rely on specific molecular machinery to conduct immunological or other elementary functions.Because GO-CC gene sets have a tree-like hierarchical structure, we are interested in how the difference between knowledgebased and context-based networks is decomposed into within-family and between-family portions (section 2.7).
The chosen families of gene sets are level-3 GO-CC gene sets and their descendants.Our finding (Fig. 3.C) shows that the between-family fold changes (median 2, range 0.86-16.7)are remarkably higher than within-family fold changes (median 0.99, range 0.8-1.23).For instance, the family GO:0070161 (anchoring junction) exhibits the highest ratio of these two metrics.It has a within-family fold change of 0.93, implying its within-family relations are, on average, similar to those found in the knowledge-based reference.In contrast, its between-family fold change with other families is 16.74, indicating that their contextual relations are quite different from those in the knowledge-based approach.A specific example is the family GO:0005681 (spliceosomal complex).The proximity of its family members to the GO:0070161 family in the UMAP space is illustrated in Fig. 3.D.In the Jaccard layer, these two families are separated by distance.But in the layers of the multiplex network, many of their descendant gene sets (e.g., GO:0014704 and GO:0071006) become variably close to each other.A biological explanation is that spliceosomes actively process pre-mRNA, and the spliced mRNA will impact the product of many molecules, including those related to the anchoring junction.For immune cells, the anchoring junction relates to, for example, cell migration, cell adherence, and antigen presentation.Thus, unsurprisingly, these two families of gene sets interact in context-based networks.Many of their family members aggregate in the embedded UMAP space (Fig. 3.D) in line with their within-family fold change.Thus, the model of multiplex networks of gene sets can provide insights into context-based gene-set relations that significantly differ from their knowledge-based counterpart.In contrast, their between-family fold changes can be much higher.Families GO:0070161 and GO:0005681 are parent gene sets.Their between-family relations change remarkably from the knowledge-based Jaccard layer to layers in the multiplex network, as visualized by the distributions of their family gene sets in the embedded UMAP space.Some of their descendant gene sets (marked by circles) become close in the multiplex network embedding.

Structural coefficients can provide multifaceted information for gene sets in a multiplex network
Now, we turn to another study on the multiplex network of GO-BP gene sets.The scRNA-seq dataset for computing S KNN is pbmc3k.final,which can be easily fetched by SeuratData.Cell populations for measurements include naive CD4 T cells, memory CD4 T cells, CD8 T cells, B cells, and CD14 monocytes (section 2.2).The threshold for truncating edge weights is relaxed to 0.1 to preserve more network variations, which results in low pairwise NMI scores among the layers in the multiplex network.The benefit is to reduce isolated nodes and include more gene sets in the measurements of their structural properties.Nevertheless, the clusters of T cell subtypes and the affinity between CD8 T cells and monocytes are observable in the dendrogram (Fig. 4.A).The knowledge-based Jaccard layer again shows the lowest layer-wise similarities to context-based layers in the multiplex network (Fig. 4.A).
Multiplex community detection reveals five major clusters after merging small groups (less than 200 over 5336 filtered gene sets) of gene sets into one extensive collection.The two most significant communities (blue and orange, communities 0 and 1 in Fig. 4.D) are visibly separated (Fig. 4.B, leftmost).Their primary difference is that the gene sets in community 0 are mainly large gene sets with weaker connections to other gene sets.In comparison, the gene sets in community 1 are usually small gene sets with stronger relations to others (Fig. 4.D, bottom left).Community 4 (purple) is an extensive collection merged from isolated small clusters of gene sets.Interestingly, members in community 4 are mainly small gene sets that possess high multiplex clustering coefficients (Fig. 4.D), indicating they are somehow borderline and tightly interconnected groups in the community.In the auxiliary Jaccard network, the partitions of multiplex communities disappear as they blend, which can be explained by the fundamental difference between a knowledge-based network and a context-based multiplex network.In summary, the multiplex communities in this experiment are primarily determined by their basic structural properties in the network, that is, gene set size and connections.
Several structural coefficients for individual gene sets can be computed, as described in section 2.7.They reflect the diversity of gene sets in their community from a structural perspective.They are concise summaries of gene sets' particular local, global, or dynamic features, as shown in Fig. 4.B (second from the left to the rightmost), which visualizes the patterns of three different structural coefficients in the multiplex network.We provide the code and data in the GitHub repository for these 3D visualizations.GO:0034368 (protein-lipid complex remodeling) is the gene set with the highest multiplex PageRank centrality (ratio of 30.03).The protein-lipid complex has been reported to be a positive regulator of immune cells [57].GO:0051240 (positive regulation of multicellular organismal process) is the gene set with the highest PageRank centrality (ratio of 2.74) in the Jaccard layer.Compared with GO:0034368, it gives less contextual information, and regardless of which cell population, it has the same PageRank centrality given the same collection of GO-BP gene sets.GO:0019835 (cytolysis) exhibits the lowest value for the participation coefficient, reflecting its biased connections in specific layers.Further investigation shows that the connections between GO:0019835 and other gene sets are significantly more active in the "B cells" layer.GO:0015867 (ATP transport) has the highest level of multiplex clustering coefficients (C2).Its clustering neighbors include several gene sets that are descendants of GO:0006862 (nucleotide transport), for example, GO:0015865 (purine nucleotide transport) and GO:0051503 (adenine nucleotide transport).Rarely, a gene set may occupy high ranks in more than two categories of structural coefficients (yellow in Fig. 4.C).
The association between two structural features of gene sets is generally complicated and non-monotonous (Fig. 4.D).However, we can notice some clear associations.The gene set size significantly influences the PageRank centrality of a gene set in the Jaccard network, as a disproportionate number of gene sets in community 0 (often large gene sets) have higher-than-average PageRank centrality in the Jaccard network, resulting in a Spearman correlation of 0.83 (Fig. 4.D), which is not valid for multiplex PageRank centrality because both small and large gene sets can have high centrality depending on the context (Fig. 4.D).Clustering coefficients C1 and C2 are associated significantly in this experiment.Large gene sets tend to have lower clustering coefficients than do small gene sets.However, their clustering coefficients are unrelated to centrality, as indicated by the two peaks in their co-plots (Fig. 4.C, D).The irrelevance illustrates the fundamental difference between centrality and clustering coefficients, as the former more often represents a dominant role in a hierarchical system.The latter reflects the tightness of inter-connections among members in the neighborhood of a gene set.

Using structural coefficients to reorganize and prioritize enriched terms in GSEA
The contextual and relational attributes of gene sets obtained by studying a relevant multiplex network can be transferred to GSEA.We run a typical GSEA on a list of differentially expressed genes.The casecontrol groups are memory CD4 T cells and other cell populations, including naive CD4 T cells, CD8 T cells, B cells, and CD14 monocytes in pbmc3k.final(Fig. 5.A), which are the same groups of cells used for computing similarities and building the multiplex network in section 3.3.The absolute NES values in descending order are used as the criteria to rank enriched GO-BP gene sets (Fig. 5.A).The structural coefficients can be retrieved from the multiplex network built in section 3.3.
The distribution of three structural coefficients of enriched terms in GSEA is displayed in Fig. 5.B.Gene sets with extreme scores in these structural measures, such as GO:0019886, can be easily identified.The UpSet plot visualizes the exclusive intersections among the 6 collections of gene sets that consist of the top 30 gene sets ranked by different standards (Fig. 5.C).A multipartite graph shows the detailed connections between these collections of high-ranking gene sets (Fig. 5.D).High-ranking gene sets identified by the two centrality coefficients or the two clustering coefficients have significant overlapping (Fig. 5.C-D).But for the two centrality coefficients, the multiplex PageRank centrality emphasizes context more; for example, the ranks of immune-related gene sets GO:0002682 and GO:0002684 increase compared to their positions determined by the PageRank centrality in the Jaccard network.Some noticeable gene sets appear in the lists of top NES values and other structural coefficients.For example, GO:0007155 and GO:0022610 (cell/biological adhesion) have upper-level multiplex PageRank coefficients and NES values, and GO:0002377 (immunoglobulin production) has a high NES and a low participation coefficient.Despite these unique findings, some problems are apparent in this raw arrangement of enriched terms.First, many top-ranked items by the NES have redundant information related to antigen processing and presentation (Fig. 5.A, D).Second, the gene sets with top centrality coefficients mainly consist of general terms that need to be more detailed for an in-depth interpretation.For instance, GO:0002684 (positive regulation of the immune system process) relates to the difference between memory T cells and other immune cells.Still, more specific information is needed for interpreting the functional meaning of differentially expressed genes.
To solve the first problem, we note that these gene sets also have high-ranking clustering coefficients, which suggests they are located in a densely interconnected region whose members share robust relationships across layers (Fig. 5.D), and a representative term alone may suffice to identify them.In other words, the groups of interconnected gene sets in a multiplex network are likely to share common functions in context and are eligible for being compressed into singular terms.By searching consistent neighbors, the first group of interconnected gene sets are found to include GO:0019886, GO:0002478, GO:0002495, GO:0002501, GO:0002503, GO:0002504, GO:0019884, GO:0019882, GO:0048002, GO:0002396, and GO:0002399.In the GO hierarchical   tree, GO:0002501, GO:0002503, GO:0002399, and GO:0002396 are descendant gene sets of GO:0065003 (protein-containing complex assembly), and GO:0019884, GO:0048002, GO:0002504, GO:0002478, and GO:0019886 are descendants of GO:0019882 (antigen processing and presentation).We then compute the (local) multiplex PageRank centrality for members in the group and find that GO:0002495 (antigen processing and presentation of peptide antigen via MHC class II) has the top centrality score.From a biological standpoint, the MHC class II mediates the antigen presentation for immune cells.Thus, GO:0002495 is a reasonable representative for this gene-set group in the multiplex net-work (Fig. 6.A).Then, the members in the group, except the representative gene set, are removed from the enriched terms.Similarly, we find another three clusters of gene sets, though not exhaustively.Each group of gene sets is represented by a particular gene set with the highest local centrality score (Fig. 6.B).Regarding the second problem, we can adjust the window of centrality to analyze enriched terms with lower multiplex centrality.Thus, we skip the top 15 items in the list of multiplex PageRank centrality, followed by many items with more specific information.The priority of enriched terms is assigned to gene sets that appear on the ranking lists of both NES and multiplex PageRank central-ity, such as GO:0002521 (leukocyte differentiation) and GO:0042110 (T cell activation) (Fig. 6.C), which are biologically more informative than the initial list of top enriched terms.We can gain additional insights into enriched terms by checking the gene sets' per-layer "participation" (section 2.8).For example, there is a general tendency of many gene sets for higher participation in the layer of CD8 T cells, for instance, GO:0045333 (cellular respiration) (Fig. 6.D).GO:0042100 (B cell proliferation) and GO:0002495 have more connectivities in the layer of B cells, which concords with the literature that B cells are classical antigen-presenting cells [58].GO:0042100 (wound healing) has a layerspecificity to the network of CD14 monocytes, and there is evidence to support the role of monocytes in wound healing [59].These examples suggest that a gene set may connect more with other gene sets in the cell populations where it functions more actively.
In summary, we propose a heuristic pipeline based on the relational attributes of gene sets to reorganize and rank gene sets (section 2.8): 1. Gene sets with high multiplex clustering coefficients (C1 and C2) are clues to searching densely interconnected groups of gene sets.Terms with maximum (local) multiplex centrality can represent other group members so that each group can be compressed into a singular item.2. Because gene sets with high multiplex PageRank centrality are often general terms with ambiguous meanings (i.e., difficult to interpret), we can adjust the scope to focus on enriched terms with lower multiplex centrality.However, centrality is a sensible measure for prioritizing gene sets.Thus, if an enriched term appears at the top of the NES rank list but has a low multiplex centrality, we can temporarily skip it to find terms with greater centrality.3. The participation coefficients can assist in a more comprehensive study of enriched terms, as they reflect the biased connectivity of gene sets in different contexts.
Nevertheless, the pipeline is arbitrary, and it may also make sense to follow the order of multiplex centrality or participation coefficients regardless of the NES values.Moreover, the knowledge-based structural coefficients can sometimes approximate the context-based multiplex network model results.But as we demonstrated in section 3.2, the multiplex network model can capture the biological variation of intergene-set relations that the knowledge-based model cannot.

Conclusion
In this study, we explored the new application of a classical KNNbased method to computing the similarity of gene sets.The simulation and running time comparison experiments demonstrated its superior computing performance compared to other similarity measures.Facilitated by S KNN , we built a multiplex network of gene sets to capture biological variation, which better reflects the dynamic and complex nature of gene-set interactions.The structural coefficients of gene sets extract relational characteristics of gene sets in their community, enabling a more thorough understanding of their roles.When researchers perform GSEA, they often have to make a decision on the criteria to prioritize enriched terms to identify the most important ones.Currently, the criteria are monotonous because only the effect size/significance of enrichment is emphasized and gene sets' contextual and relational features are often ignored.However, we can take advantage of these structural coefficients to identify a group of comprehensive criteria to reorganize and prioritize enriched terms, which may enhance the interpretability of GSEA.

Fig. 2 .
Fig. 2. Results of the simulation and comparison studies.(A) Upper left: Most similarity measures exhibit close-to-zero scores for a pair of independent datasets, except for  (exponent = 1).Upper right, lower left: S KNN captures the similarity gradient between a dataset and its subset with increasingly overlapped features.Lower right: The similarity corresponding to a non-linear function remains relatively stable with inputs of various dimensions.The widths of colored bands represent the range of scores in 50 repeated experiments.(B) Outcomes by different similarity measures for real gene sets.(C) The running time of   and   grows much faster than that of S KNN and   as the sample size or number of gene sets increases (left).The measurement by   does not correlate well with the outcomes of   (middle) or   (right), though it has the shortest running time.(D, E) The impact of hyperparameters  and metrics on the performance of S KNN .

Fig. 3 .
Fig. 3. Difference between the knowledge-based Jaccard network and the context-based multiplex network.(A) NMI scores indicate similarity between pairs of layers.The NMI scores and the hierarchical clustering show that the Jaccard layer differs significantly from all other layers in the multiplex network.(B) View of the distributions of GO-CC and ImmuneSig gene sets by Grimon.Each splice corresponds to a plot in D by its label.The coordinates of gene sets in the slices are UMAP coordinates adjusted by Grimon.There is a pronounced change in terms of the interactions between gene sets of GO-CC and ImmuneSig in the Jaccard network and layers in the multiplex network.(C, D) Most level-3 GO-CC gene sets and their descendants show within-family fold changes around 1.In contrast, their between-family fold changes can be much higher.Families GO:0070161 and GO:0005681 are parent gene sets.Their between-family relations change remarkably from the knowledge-based Jaccard layer to layers in the multiplex network, as visualized by the distributions of their family gene sets in the embedded UMAP space.Some of their descendant gene sets (marked by circles) become close in the multiplex network embedding.

Fig. 4 .
Fig. 4. Structural properties of a multiplex network of GO-BP gene sets.(A) NMI scores indicate layer-wise similarities.The dendrogram shows the hierarchical clustering of layers in the multiplex network.Upper: The threshold is 0.2 for filtering edges.Lower: The threshold is relaxed to 0.1 to preserve variations of intergene-set relations.(B) Visualizations of the multiplex communities and distributional patterns of structural coefficients.The coordinates of gene sets in each layer are UMAP coordinates adjusted by Grimon.Each layer corresponds to a cell population in the PBMC3K scRNAseq dataset, where the inter-gene-set similarities are measured.(C) The distribution of three primary structural coefficients of GO-BP gene sets.The ratios of centrality coefficients to the uniform mass ( 1 number of gene sets ) rather than raw coefficients are shown.Gene sets with higher than 90th percentile C2 and multiplex PageRank coefficients or those with lower than 10th percentile participation coefficients are marked by dark cyan; gene sets in more than one category are marked by yellow.(D) Even though the Spearman correlations among attributes of gene sets may be high, the relations between structural coefficients are generally complex and non-monotonous, as indicated by the locally weighted linear regression lines (red).

Fig. 5 .
Fig. 5. Application of a gene-set multiplex network to GSEA.(A) Left: GSEA experiment on a list of differentially expressed genes.Hundreds of genes are differentially expressed in the memory CD4 T cells (mostly downregulated) compared with other cell populations.Right: Top enriched terms ranked by NES.(B) Distribution of multiplex PageRank centrality, C2 clustering coefficients, and participation coefficients of enriched gene sets.(C) The UpSet plot shows the exclusive intersections among collections of the top 30 enriched gene sets ranked by NES, multiplex/Jaccard PageRank centrality, multiplex clustering coefficients C1 and C2, and participation coefficients.(D) Raw multipartite graph of enriched terms.Many of the top NES enriched gene sets, such as GO:0019886, also have upper-level multiplex clustering coefficients.Some unique gene sets, such as GO:0002684 and GO:0002377, appear at the top of lists by NES and the multiplex PageRank centrality or participation coefficient.However, the raw arrangement of enriched terms has some problems (section 2.8).The first is information redundancy among enriched terms with high clustering coefficients.Second, gene sets with high-level multiplex centrality, such as GO:0002684, are primarily general terms with ambiguous meanings.

Fig. 6 .
Fig. 6.Reorganizing and prioritizing enriched terms by structural coefficients.(A) One group of densely interconnected gene sets.The graph is the embedding of the average adjacency matrix of the multiplex layers.A lighter hue depicts the lines between gene sets with more minor similarities.The color of a node represents its multiplex PageRank centrality in the local network across layers.GO:0002495 exhibits the maximum multiplex centrality among the group members and thus is eligible to represent the group.(B) In the same way, three additional groups are identified, and the local centrality identifies their representative gene sets.The color of a node in this plot represents its multiplex PageRank centrality among all the groups.(C) Organization of enriched terms after compressing groups of enriched terms above and constraining the scope of multiplex centrality to a lower level.Finally, enriched terms are ranked by NES but skipped if they have low-level multiplex centrality.(D) The heatmap visualizes per-layer "participation" of enriched terms with the lowest participation coefficients.A general tendency towards more connectivity in the layer of CD8 T cells is evident.

Table 1
With similar notation, the empirical distance covariance ( in Table1) for  ∈ ℝ × and  ∈ ℝ × is defined as Functions to measure similarity/multivariate association.